Astron. Nachr./AN 329, No.3,288-291 (2008)/DOI 10. 1002/asna.2007 10943 



Automated Probabilistic Classification of Transients and Variables 

Ashish Mahabal 1 >*, S.G. Djorgovski 1 , M. Turmon 2 , J. Jewell 2 , R.R. Williams 1 , A.J. Drake 1 , M.G. 
Graham 1 , C. Donalek 1 , E. Glikman 1 , and the Palomar-QUEST Team 

1 California Institute of Technology, Pasadena, CA 91 125, USA 

2 Jet Propulsion Laboratory, Pasadena, CA, USA 



oo 

o 

O 

<N 

X> 
(N 

6 

, 1 1 Introduction 



> 
On 
On 

m 

(N 
O 
OO 

o 



X 



Received 2007 Sep 1, accepted 2007 Nov 27 
Published online 2008 Feb 25 

Key words Classification, Bayesian networks, Transients, Variables 

There is an increasing number of large, digital, synoptic sky surveys, in which repeated observations are obtained over 
large areas of the sky in multiple epochs. Likewise, there is a growth in the number of (often automated or robotic) 
follow-up facilities with varied capabilities in terms of instruments, depth, cadence, wavelengths, etc., most of which are 
geared toward some specific astrophysical phenomenon. As the number of detected transient events grows, an automated, 
probabilistic classification of the detected variables and transients becomes increasingly important, so that an optimal use 
can be made of follow-up facilities, without unnecessary duplication of effort. We describe a methodology now under 
development for a prototype event classification system; it involves Bayesian and Machine Learning classifiers, automated 
incorporation of feedback from follow-up observations, and discriminated or directed follow-up requests. This type of 
methodology may be essential for the massive synoptic sky surveys in the future. 
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Traditional practice of time-domain astronomy generally in- 
volves targeted observing of small samples or even individ- 
ual instances of a particular type of variable objects or phe- 
nomena. The recent advent of large, digital synoptic sky sur- 
veys is now revolutionizing the field, thanks to the advance- 
ment of computing power and detectors (CCDs). The field 
has been moving towards a systematic exploration of larger 
areas with a better time sampling and understanding of finer 
details of many phenomena (e.g., GRBs, supernovae, vari- 
able AGN, etc.). Many of the events, especially those that 
vary on short time scales, need rapid follow-up for proper 
understanding and scientific exploitation. This has resulted 
in a number of robotic telescopes which can turn to a target 
very quickly for such follow-ups. 

A key link is between event producers (e.g., synoptic 
surveys, GRB satellites, etc.) and consumers or follow-up 
facilities. The last few years have seen the emergence of 
computer networks and protocols which can collect streams 
from large surveys and distribute those to facilities that can 
go after interesting events. The synergy between Palomar- 
Quest survey (http://palquest.org) and the VOEventNet sys- 
tem (http://voeventnet.caltech.edu) for the distribution, clas- 
sification, and follow-up of events is such an example (Djor- 
govski et al. 2006, 2007). 

In this paper, we describe in more detail the ongoing de- 
velopment of the automated event classification and follow- 
up engine for this system. This experience and methodology 
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should be useful more broadly, for other synoptic sky sur- 
veys, both existing and planned. 

As more synoptic sky surveys come online, the problem 
is going to be one of plenty. On the one hand there will be 
too many events to follow-up individually, and on the other 
hand not all follow-up facilities would be willing or capable 
of tracking all types of events due to constraints on bright- 
ness, wavelength, sky visibility, etc. More importantly, most 
follow-up facilities are generally interested in only specific 
types of objects or phenomena (owing to research interests, 
policy, funding, etc.). The most critical issue then is of clas- 
sifying events so that they can be matched up with facilities 
ready for them and without unnecessary duplication of ef- 
fort. We note that an automated event classification of any 
kind in time-domain astronomy has never been done; and it 
may well turn out to be the key enabling technology for the 
massive synoptic sky surveys of the future. 

Transient event classification is a very challenging prob- 
lem. A key difficulty is the sparsity of information initially 
available, e.g., position on the sky and magnitude in one or 
two bands at a couple of epochs. Incorporation of archival 
data and follow-up observations is essential. As the avail- 
able information increases, iterative classification is needed. 
One class of existing methodologies is Machine Learning 
(ML) based. It includes Support Vector Machines (SVMs), 
Artificial Neural Networks (ANNs) etc. On the other hand, 
Bayesian classifiers may be more powerful for these appli- 
cations, owing to the variable and incomplete nature of the 
data. Priors for distributions of observable event parameters 
can be formed for different types of objects and probabili- 
ties evaluated for each class. 
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An important post-classification step is that of feedback 
based on actual follow-up, as it will help improve the priors 
and the resulting classifications. When classification prob- 
abilities are inconclusive, an intelligent follow-up request 
engine can suggest the best follow-up facility to serve as a 
tie-breaker between two or more competing classes. 

In the next sections we describe the Bayesian classifi- 
cation scheme, associated supervised learning schemes that 
can exploit known parameter dependancies better, revision 
of priors based on feedback, and the follow-up request en- 
gine. Throughout the discussion here, transients are treated 
as a special class of variables that are typically seen only 
once in a given survey (an operational definition), although 
they may have counterparts previously seen in other data. 

2 Methodology 

2.1 Bayesian Event Classification 

Consumers of transient events are usually interested only 
in particular kinds of sources, e.g., supernovas of a given 
type to be used either as cosmological standard candles, or 
as the probes of the endpoints of stellar evolution; GRB af- 
terglows; gravitational microlensing events, especially with 
the possible planetary signatures; flaring AGN; etc. Thus 
the desired output of a classification system is to evaluate 
a probability of any given event as belonging to each of 
the possible known classes. Self-imposed probability accep- 
tance cut-off can then allow individual consumers to decide 
if a particular event is worth following. The most interest- 
ing outcome may be the events which do not fit any of the 
known patterns and thus are possibly examples of new types 
of astronomical objects or phenomena. 

Prior distributions need to be estimated for each type of 
variable astrophysical phenomena that we want to classify, 
even though a particular event classification is inevitably 
based on incomplete data. Then an estimated probability of 
a new event belonging to any given class can be evaluated 
from all pieces of information available. Such information 
in some format has already been collected by various groups 
for particular types of objects, e.g., the Supernova Typing 
Machine (http://wise-obs.tau.ac.il/ dovip/typing/) uses mag- 
nitudes in different filters at different epochs to try and de- 
termine the type of SN a particular event is. Another exam- 
ple is the search for quasars in a particular redshift bin based 
on certain broadband colors. 

A schematic of the Bayesian Event Classification (BEC) 
engine is shown in Fig. 1. To take an example, denote the 
feature vector of event parameters as x, and the object class 
that gave rise to this vector as y, 1 < y < K. Many potential 
entries within x will be unknown as the information may be 
incomplete. In practice, certain fields within x will almost 
certainly be known, e.g. sky position, brightness in selected 
filters etc.. However, other parameters will be known only 
selectively: brightness change over various time baselines, 
and object shape. It is because of the dominance of miss- 
ing values, as well as the abundance of prior information 
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Fig. 1 A schematic illustration of the desired functionality of the 
Bayesian Event Classification (BEC) engine. The input is gener- 
ally sparse discovery data, including brightness in various filters, 
possibly the rate of change, position, possible motion, etc., and 
measurements from available multi-wavelength archives; and a li- 
brary of priors giving probabilities for observing these particular 
parameters if the event belongs to a class X. The output is an evolv- 
ing set of probabilities of belonging to various classes of interest. 

that a Bayesian classification methodology is likely to work 
best as has been demonstrated by its effectiveness in such 
applications as document classification and patient diagno- 
sis, where there are many sparsely known attributes. In this 
view, x and y are related via 

P(y = k\x) = P{x\y = k)P{k)/P{x) 
a P(k)P(x\y = k) 
w P(k)Ui =1 P(x b \y = k) 
Because we are only interested in the above quantity as a 
function of k, we can drop factors that only depend on x. 
Furthermore, we have assumed that, conditional on the class 
y, the feature vector decomposes into B roughly indepen- 
dent blocks, generically labeled Xf,- These blocks may be 
singleton variables, or contain multiple variables - for ex- 
ample, sets of highly correlated filters. The decoupling al- 
lows us to ( 1 ) circumvent the curse of dimensionality as the 
decomposition keeps the dimensionality of each block man- 
ageable (we will eventually have to learn the conditional 
distributions P(xb\y = k) for each k. As more components 
are added to x b , more examples will be needed to learn the 
corresponding distribution), and (2) cope easily with igno- 
rance of missing variables by dropping the corresponding 
factors from the product above. The methodology makes a 
seemingly strong independence assumption, but in practice, 
because we are after a classification and not an exact mem- 
bership probability, classification results can be excellent 
even when the assumption is violated (Hand & Yu, 2001; 
Domingos & Pazzani 1997). 

2.2 Machine Learning Event Classifier 

Besides the BEC, traditional supervised classification meth- 
ods such as the ANNs and/or SVMs (Vapnik 1995; Cris- 
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Fig. 2 The Event Classification Engine, containing two sepa- 
rate classifiers described in the text (BEC and MLEC) would pro- 
vide event classifications, and incorporate newly obtained data 
from follow-up observations for improved, iterative event classi- 
fications. Reliably classified events can be used for refinements of 
the event classifiers, as well as for the operation of the Follow-up 
Prioritization Engine (FPE). 



tianini & Shawe-Taylor 2000; Fan, Chen & Lin 2005; Rip- 
ley 1996; Scholkopf & Smola 2001) can be used for con- 
firmed event databases with large, nonsparse training and 
validation data sets where the use of supervised networks 
is already well established. Such a Machine Learning Event 
Classifier (MLEC) can represent events as vectors of ob- 
servable parameters X = x±, Xi, x n , where Xi are 
various observed quantities for the large majority of events, 
e.g., flux amplitudes in various filters, coordinates, flux ra- 
tios, etc. 

Two types of problems can be expected: (1) not all pa- 
rameters would be measured for all events. For example, 
some may be missing a measurement in a particular filter, 
due to a detector problem; some may be in the area on the 
sky where there are no useful radio observations; etc. A par- 
tial solution is to train a set of quasi-independent classifiers 
and invoke the one most suited based on observations avail- 
able. (2) many observables would be given as upper or lower 
limits, rather than as well defined measurements. This can 
be partly solved by treating them as actual measurements 
or missing values leading to inaccurate or lossy data. Thus, 
this approach may be more useful for a classification of 
variable (always present, but changing) sources, rather than 
transients (detected only once). However, the performance 
of MLEC would be constantly improving as more follow- 
ups happen. A schematic combining the BEC, MLEC and 
the feedback stages is shown in Fig. 2. 

2.3 Feedback Incorporation 

A crucial feature of the system should be the ability to up- 
date and revise the prior distributions on the basis of the ac- 
tual performance, as we accumulate the true physical clas- 
sifications of events, e.g., on the basis of follow-up spec- 
troscopy. Learning, in the Bayesian view, is precisely the ac- 
tion of determining the probability models above - once de- 
termined, the overall model (1) can be used to answer many 
relevant questions about the events. Analytically, we formu- 
late this as determining unknown distributional parameters 
8 in parameterized versions of the conditional distributions 



above, P(x\y — k;8). (Of course, the parameters depend 
on the object class k, but we suppress this below.) In a his- 
togram representation, 8 is just the probabilities associated 
with each bin, which may be determined by computing the 
histogram itself. In a Gaussian representation, 8 would be 
the mean vector \i and covariance matrix E of a multivari- 
ate Gaussian distribution, and the parameter estimates are 
just the corresponding mean and covariance of the object-fc 
data. When enough data is available we can adopt a semi- 
parametric representation in which the distribution is a lin- 
ear superposition of such Gaussian distributions: 



M 



P(x d \y 



5> 



N(x d ; 



This generalizes the Gaussian representation, since by in- 
creasing M, more distributional characteristics may be ac- 
counted for. The corresponding parameters may be chosen 
by the Expectation-Maximization algorithm (Turmon, Pap 
& Mukhtar 2002) or kernel density estimation (Silverman 
1986; John & Langley 1995). Three possible sources of in- 
formation can be used to find the unknown parameters: (1) 
background physical knowledge, e.g. from considerations 
of monotonicity, (2) examples labeled by experts, (3) feed- 
back from the downstream observatories once labels are de- 
termined. The first case gives an analyical form for the dis- 
tribution, but the last two provide labeled examples, (x, y), 
which can be used to select a set of k probability distri- 
butions as described above. The parallel performance of the 
Bayesian and Machine Learning event classification engines 
can be evaluated and compared, and the output of both used 
- unless one turns out to be clearly superior to the other. 

2.4 Follow-up Prioritization Engine 

The sparse data can often lead to cases of ambiguous classi- 
fication or perhaps it may not lead to meaningful classifica- 
tions at all. On such occasions a follow-up prioritization en- 
gine can suggest the best follow-up strategy to reduce con- 
fusion between competing classes. For example, the system 
may decide that obtaining optical light curve with a par- 
ticular time cadence would discriminate between a Super- 
nova and a quasar, or that a particular color measurement 
would discriminate between a cataclysmic variable eruption 
and a gravitational microlensing event, etc. Suitable priori- 
tized requests for the needed follow-up observations would 
be generated and sent to the appropriate telescopes. Since 
observational resources are scarce it is important to rank or- 
der the possible follow-up observations according to which 
ones result in the most reduction in classification uncer- 
tainty. This can be done using an information-theoretic ap- 
proach (Loredo & Chernoff 2003) by quantifying the clas- 
sification uncertainty using the conditional entropy of the 
posterior for y, given all the available data. When an addi- 
tional observation, x + , is taken, the entropy decreases from 
H(y\xo) to H(y\xo, x+). This is illustrated in Fig. 3, where 
the original classification p(y\xo) is ambiguous and may be 
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Fig. 3 A schematic illustration of follow-up observation recom- 
mendations: At left, the initial estimated per-class probabilities for 
eight object classes, showing high entropy resulting from ambi- 
guity between the object classes numbered 1, 6, and 7. Followup 
observations from two telescopes are possible (center). Their re- 
solving capacity is shown as a function of class y (left axis) and 
observed value (right axis parallel to green arrows). In the diagram, 
for telescope 1, as observed value xa moves up the green arrow, 
class 6 becomes increasingly preferred. For telescope 2, moderate 
values (near the crossbar in the arrow) indicate class 6, and other 
values indicate class 7. Finally, at right, typical updated classifica- 
tions. The lower-entropy classification at the top is preferred. Since 
the particular values used for refinement (xa, xb) are unknown at 
decision time, appropriate averages of entropy must be used, as 
described in the text. 



refined in one of two ways. The refinement for particular 
observations xa versus xb is shown. The correct choice is 
the one that will reduce the final entropy most. In our nota- 
tion, the best follow-up observation is the one which results 
in the minimal final entropy, given by 

X + = min x+ H(y\x + ,x ) 

= -^2p(y,x+\x ) logp(y|x + ,x ) 

In computing this, we average over all possible values of 
the new measurement x+ and class y. Note, this is equiv- 
alent to maximizing the conditional mutual information of 
x+ about y, given xq\ that is, I(y; x+ \xq) (Cover & Thomas 
1991). The joint density above is known within the context 
of our assumed statistical model. Specifically, we have a 
joint probability of the form: 



p(y,x+\x Q ) 



p{x +1 x \y)p(y) 
^2p(x+,x \y)p{y) 



yx+ 



where the right hand side is given by factors as in (1). The 
conditional probability, 



p(y\x+,x ) = 



p{x +1 x \y)p(y) 
^2p{x + ,x \y)p(y) 



is the Bayes posterior given the new and previous measure- 
ments. Therefore, we can compute, within the context of the 



previously learned statistical model used to define the poste- 
rior in (1), the follow-up measurement resulting in the great- 
est entropy reduction given the previous measurements. We 
can thus provide a rank-ordered list of potential follow-up 
observations according to the information-theoretic rank- 
ing, leading to the most efficient use of the resources. 

3 Summary 

We have presented a software methodology, now under de- 
velopment, for an automated, iterative probabilistic classifi- 
cation of variables and transients found in large digital syn- 
optic sky surveys. Our primary approach is using Bayesian 
networks, with a parallel development of classifiers based 
on Machine Learning techniques. Incorporation of feedback 
from follow-up observations is essential, both to update the 
Bayesian priors, and to improve the training data sets for the 
ML algorithms. Another innovation is an engine to discrim- 
inate between possible follow-up facilities for optimal re- 
sults as well as faster rate of learning. Experimental imple- 
mentations of this methodology with existing surveys such 
as PQ should both enhance their scientific returns, and help 
lay the groundwork for the more ambitious projects in the 
future, such as PanSTARRS and LSST. 
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